<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
	<title>多词同义词和短语查询 | Elasticsearch: 权威指南 | Elastic</title>
    <!-- Give IE8 a fighting chance -->
    <!--[if lt IE 9]>
    <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
    <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
	<link rel="stylesheet" type="text/css" href="../static/styles.css" />
</head>
<body>
<div class="main-container">
    <section id="content">
        
        <div class="content-wrapper">
            <section id="guide" lang="zh_cn">
                <div class="container">
                    <div class="row">
                        <div class="col-xs-12 col-sm-8 col-md-8 guide-section">
                            <div style="color:gray; word-break: break-all; font-size:12px;">原文地址: <a href="https://www.elastic.co/guide/cn/elasticsearch/guide/current/multi-word-synonyms.html" rel="nofollow">https://www.elastic.co/guide/cn/elasticsearch/guide/current/multi-word-synonyms.html</a>, 版权归 www.elastic.co 所有<br/>
                            英文版地址: <a href="https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-word-synonyms.html" rel="nofollow">https://www.elastic.co/guide/en/elasticsearch/guide/current/multi-word-synonyms.html</a>
                            </div>
                        <!-- start body -->
                  <div class="page_header">
<b>请注意:</b><br>本书基于 Elasticsearch 2.x 版本，有些内容可能已经过时。
</div>
<div id="content">
<div class="breadcrumbs">
<span class="breadcrumb-link"><a href="index.html">Elasticsearch: 权威指南</a></span>
»
<span class="breadcrumb-link"><a href="languages.html">处理人类语言</a></span>
»
<span class="breadcrumb-link"><a href="synonyms.html">同义词</a></span>
»
<span class="breadcrumb-node">多词同义词和短语查询</span>
</div>
<div class="navheader">
<span class="prev">
<a href="synonyms-analysis-chain.html">« 同义词和分析链</a>
</span>
<span class="next">
<a href="symbol-synonyms.html">符号同义词 »</a>
</span>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h2 class="title">
<a id="multi-word-synonyms"></a>多词同义词和短语查询 (Multiword Synonyms and Phrase Queries)<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/260_Synonyms/60_Multi_word_synonyms.asciidoc">edit</a>
</h2>
</div></div></div>
<p>至此，同义词看上去还挺简单的。然而不幸的是，复杂的部分才刚刚开始。 为了能使 <a class="xref" href="phrase-matching.html" title="短语匹配">短语查询</a> 正常工作，
Elasticsearch 需要知道每个词在原文中的位置。多词同义词会严重破坏词的位置信息，尤其当新增的同义词长度各不相同时。</p>
<p>我们创建一个同义词(synonym)语汇单元过滤器(token filter)，然后使用下面这样的同义词规则：</p>
<pre class="literallayout">"usa,united states,u s a,united states of america"</pre>

<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "usa,united states,u s a,united states of america"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter"
          ]
        }
      }
    }
  }
}

GET /my_index/_analyze?analyzer=my_synonyms&amp;text=
The United States is wealthy
</pre>
</div>
<p><code class="literal">analyze</code>API 会输出下面这样的结果：</p>
<div class="pre_wrapper lang-text">
<pre class="programlisting prettyprint lang-text">Pos 1:  (the)
Pos 2:  (usa,united,u,united)
Pos 3:  (states,s,states)
Pos 4:  (is,a,of)
Pos 5:  (wealthy,america)</pre>
</div>
<p>如果你用上面这个同义词语汇单元过滤器索引一个文档，然后执行一个短语查询，那你就会得到惊人的结果，下面这些短语都不匹配：</p>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
The usa is wealthy
</li>
<li class="listitem">
The united states of america is wealthy
</li>
<li class="listitem">
The U.S.A. is wealthy
</li>
</ul>
</div>
<p>但是这些短语却匹配：</p>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
United states is wealthy
</li>
<li class="listitem">
Usa states of wealthy
</li>
<li class="listitem">
The U.S. of wealthy
</li>
<li class="listitem">
U.S. is america
</li>
</ul>
</div>
<p>如果你是在查询阶段使用同义词，那你就会看到更加诡异的匹配结果。看下这个 <code class="literal">validate-query</code> 请求：</p>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">GET /my_index/_validate/query?explain
{
  "query": {
    "match_phrase": {
      "text": {
        "query": "usa is wealthy",
        "analyzer": "my_synonyms"
      }
    }
  }
}</pre>
</div>
<p>查询关键字会被同义词语汇单元过滤器处理成类似这样的信息：</p>
<pre class="literallayout">"(usa united u united) (is states s states) (wealthy a of) america"</pre>

<p>这会匹配包含有 <code class="word">u is of america</code> 的文档，但是匹配不出任何含有 <code class="word">america</code> 的文档。</p>
<div class="tip admon">
<div class="icon"></div>
<div class="admon_content">
<p>多词同义对高亮匹配结果也会造成影响。一个针对 <code class="word">USA</code> 的查询，返回的结果可能却高亮了： The <em>United States is wealthy</em>  。</p>
</div>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="_使用简单收缩进行短语查询"></a>使用简单收缩进行短语查询<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/260_Synonyms/60_Multi_word_synonyms.asciidoc">edit</a>
</h3>
</div></div></div>
<p>避免这种混乱的方法是使用 <a class="xref" href="synonyms-expand-or-contract.html#synonyms-contraction" title="简单收缩">简单收缩</a>，
用单个词项表示所有的同义词，
然后在查询阶段，就只需要针对这单个词进行查询了：</p>
<div class="pre_wrapper lang-json">
<pre class="programlisting prettyprint lang-json">PUT /my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "my_synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "united states,u s a,united states of america=&gt;usa"
          ]
        }
      },
      "analyzer": {
        "my_synonyms": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "my_synonym_filter"
          ]
        }
      }
    }
  }
}

GET /my_index/_analyze?analyzer=my_synonyms
The United States is wealthy</pre>
</div>
<p>上面那个查询信息就会被处理成类似下面这样：</p>
<div class="pre_wrapper lang-text">
<pre class="programlisting prettyprint lang-text">Pos 1:  (the)
Pos 2:  (usa)
Pos 3:  (is)
Pos 5:  (wealthy)</pre>
</div>
<p>现在我们再次执行我们之前做过的那个 <code class="literal">validate-query</code> 查询，就会输出一个简单又合理的结果：</p>
<pre class="literallayout">"usa is wealthy"</pre>

<p>这个方法的缺点是，因为把 <code class="word">united states of america</code> 转换成了同义词 <code class="word">usa</code>, 你就不能使用 <code class="word">united states of america</code> 去搜索出 <code class="word">united</code> 或者 <code class="word">states</code> 。
你需要使用一个额外的字段并用另一个解析器链(analysis chain)来达到这个目的。</p>
</div>

<div class="section">
<div class="titlepage"><div><div>
<h3 class="title">
<a id="_同义词与_query_string_查询"></a>同义词与 query_string 查询<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elasticsearch-cn/elasticsearch-definitive-guide/edit/cn/260_Synonyms/60_Multi_word_synonyms.asciidoc">edit</a>
</h3>
</div></div></div>
<p>本书很少谈论到 <code class="literal">query_string</code> 查询，因为真心不推荐你用它。
在 <a class="xref" href="search-lite.html#query-string-query" title="更复杂的查询">复杂查询</a> 一节中有提到，由于 <code class="literal">query_string</code> 查询支持一个精简的 <em>查询语法</em> ，它经常会导致意想不到的结果，甚至语法错误。</p>
<p>这种查询方式存在不少问题，而其中之一便与多词同义有关。为了支持它的查询语法，你必须用指定的、该语法所能识别的操作符号来标示，比如 <code class="literal">AND</code> 、 <code class="literal">OR</code> 、 <code class="literal">+</code> 、 <code class="literal">-</code> 、 <code class="literal">field:</code> 等等。
(更多相关内容参阅 <a href="https://www.elastic.co/guide/en/elasticsearch/reference/5.6/query-dsl-query-string-query.html#query-string-syntax" class="ulink" target="_top"><code class="literal">query_string</code> 语法</a> 。)</p>
<p>而在这种语法的解析过程中，解析动作会把查询文本在空格符处作切分，然后分别把每个切分出来的词传递给相关性解析器。
这也即意味着你的同义词解析器永远都不可能收到类似 <code class="word">United States</code> 这样的多个单词组成的同义词。由于不会把 <code class="word">United States</code> 作为一个原子性的文本，所以同义词解析器的输入信息永远都是两个被切分开的词 <code class="word">United</code> 和 <code class="word">States</code> 。</p>
<p>所幸， <code class="literal">match</code> 查询相比而言就可靠得多了，因为它不支持上述语法，所以多个字组成的同义词不会被切分开，而是会完整地交给解析器处理。</p>
</div>

</div>
<div class="navfooter">
<span class="prev">
<a href="synonyms-analysis-chain.html">« 同义词和分析链</a>
</span>
<span class="next">
<a href="symbol-synonyms.html">符号同义词 »</a>
</span>
</div>
</div>

                  <!-- end body -->
                        </div>
                        <div class="col-xs-12 col-sm-4 col-md-4" id="right_col">
                        
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </section>
</div>
<script src="../static/cn.js"></script>
</body>
</html>